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Abstract 


Argumentative Zoning (AZ) is an anal- 
ysis of the argumentative and rhetorical 
structure of a scientific paper. It has been 
shown to be reliably used by independent 
human coders, and has proven useful for 
various information access tasks. Annota- 
tion experiments have however so far been 
restricted to one discipline, computational 
linguistics (CL). Here, we present a more 
informative AZ scheme with 15 categories 
in place of the original 7, and show that 
it can be applied to the life sciences as 
well as to CL. We use a domain expert 
to encode basic knowledge about the sub- 
ject (such as terminology and domain spe- 
cific rules for individual categories) as part 
of the annotation guidelines. Our results 
show that non-expert human coders can 
then use these guidelines to reliably an- 
notate this scheme in two domains, chem- 
istry and computational linguistics. 


1 Introduction 


Teufel et al. (1999) define the task of Argumenta- 
tive Zoning (AZ) as a sentence-by-sentence clas- 
sification with mutually exclusive categories from 
the annotation scheme given in Fig. 1. The reason- 
ing behind the categories is inspired by the notion 
of a knowledge claim (Myers, 1992; Luukkonen, 
1992): the act of writing a paper corresponds to 
an attempt of claiming ownership for a new piece 
of knowledge, which is to be integrated into the 
repository of scientific knowledge in the authors’ 
field by the process of peer review and publica- 
tion. In the cause of this process, the authors 
have to convince the reviewers that the knowledge 
claim of the paper is valid (Swales, 1990; Hy- 
land, 1998). What AZ aims to model, then, are 
some of the relevant stages in this argument. We 
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divide the paper into zones, OTHER, OWN and 
BACKGROUND. These are defined on the basis 
of who owns the knowledge claim in the corre- 
sponding segment. There are also two categories 
which are defined by their relationship to existing 
work, BASIS and CONTRAST. That means that 
parts of the AZ scheme are similar to citation func- 
tion classification schemes from the area of cita- 
tion content analysis (Garfield, 1965; Weinstock, 
1971; Spiegel-Riising, 1977), and to automatic 
citation function classification (Nanba and Oku- 
mura, 1999; Garzone and Mercer, 2000; Teufel 
et al., 2006). The remaining categories, AIM and 
TEXTUAL, fulfil different rhetorical functions for 
the presentation of the paper. AIM points out the 
paper’s main knowledge claim, a rhetorical move 
which may be repeated in the conclusion and the 
introduction. TEXTUAL explains the physical lo- 
cation of information, e.g., by giving a section 
overview or presenting a summary of a subsec- 
tion. On the basis of human-annotated training 
material, AZ can be automatically classified using 
supervised machine learning. 


Category 


ee | Statement of research | Statement of research goal. | 


ae Description Esc ea are generally accepted 
background knowledge. 


BASIS 


aE: ee KC provides basis for new 


CONTRAST An existing KC is contrasted, com- 
pared, or presented as weak. 


[Omer «i | Description of existing KC. | of existing KC. 
Danna Description ee ee any other aspect of 
new KC. 
TEXTUAL Indication of papers textual 
structure. 


Figure 1: AZ Annotation Scheme (Teufel et al. 
1999). 


Rhetorical information marking is useful for 
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many novel information access tasks. For in- 
stance, information retrieval can profit from 
rhetorical information in the form of paradigm 
shift statements (Chichester et al., 2005), as papers 
containing such statements have a high impact in 
an area. 75% of the ’Faculty of 1000 Biology” 
papers (which are chosen by experts for their spe- 
cial importance) contain paradigm shift sentences 
(Agnes Sandor, personal communication). 


AZ annotation allows the construction of multi- 
and single document summaries which concen- 
trate on differences and similarities to related 
(cited) work. AZ can also be used for search in 
a data base of scientific articles, in particular for 
enhanced citation indexing. This has been pre- 
viously explored in a task-based evaluation, were 
users were asked to list positive and negative cita- 
tions they would expect in a paper, given a short 
extract (Teufel, 2001). In that task, AZ-based ex- 
tracts outperformed other document surrogates. 


Feltrim et al. (2005) present a writing support 
system which analyses students’ drafts of sum- 
maries for their PhD theses, performs an AZ anal- 
ysis on them and critiques the rhetorical structure 
of the students’ draft on the basis of it. 


The definition of the AZ categories is based 
on rhetorical principles and should be decidable, 
in principle, without specific domain knowledge 
about what is discussed in detail in the paper. We 
present here the first evidence that AZ categories 
can be reliably recognised across scientific disci- 
plines, using chemistry and computational linguis- 
tics as our model disciplines for these experiments. 


The categories just introduced are abstract and 
depend on the annotators’ interpretation of a 
rhetorical argument. This means that there is 
no guarantee that several independent annotators 
would annotate similarly. It is therefore crucial 
that all annotations at a high level of interpreta- 
tion are backed up by human annotation with more 
than one annotator. However, annotations of cita- 
tion function classification typically use only the 
untested annotation of a single human annotator 
as gold standard, who is typically the designer of a 
scheme (Spiegel-Riising, 1977; Weinstock, 1971; 
Nanba and Okumura, 1999; Garzone and Mercer, 
2000). Teufel et al. (2006) are the only exception 
who test their citation function scheme using mod- 
ern corpus-linguistic annotation methodology. 


A study of human agreement on AZ annotation 
exists (Teufel et al., 1999), but this uses articles 


from only one discipline, namely computational 
linguistics. In this paper, we use a similar method- 
ology to Teufel et al., but with data from two disci- 
plines. The preliminary conclusion from these ex- 
periments is that annotation with chemistry papers 
has resulted in higher agreement than annotation 
with computational linguistics papers. 

We extend the AZ annotation scheme to make 
further distinctions, as will be discussed in sec- 
tion 2. We also created an environment in which 
domain knowledge that an annotator might have 
about the science in a paper is systematically dis- 
regarded. We will describe how this was done in 
section 3, and then present the annotation experi- 
ment itself in section 4. 


2 Changes to the AZ Scheme 


Argumentative Zoning II (AZ-ID) is a new annota- 
tion scheme, which is an elaboration of the orig- 
inal AZ scheme. It is presented in Fig. 2. Our 
annotation guidelines are 111 sides of A4 and con- 
tain a decision tree, detailed description of the se- 
mantics of the 15 categories, 75 rules for pairwise 
distinction of the categories and copious examples 
from both chemistry and computational linguis- 
tics. During guideline development, 70 chemistry 
papers and 20 CL papers were used, which are dis- 
tinct from the ones used for annotation. It took 3 
months part-time-work to prepare the guidelines 
for CL, and substantially less time to adapt them 
for chemistry. We have made them available at 
www.cl.cam.ac.uk/research/nl/sciborg. 

The differences between the original AZ and 
AZ-II are as follows: 


e Category AIM remained the same. 
e Category BACKGROUND was 
CoO_GRO, or common ground. 


renamed 


e Category OTHER was split into other peo- 
ple’s work (OTHR) and the authors’ own pre- 
vious work (PREV_OWN). 

e Category BASIS was split into usage (USE) 
and support (SUPPORT). 

e Category CONTRAST was split into neu- 
tral comparison (CODI), contradiction 
(ANTISUPP), and a category combining 
research gaps with criticism (GAP_WEAK). 

e Category OWN was split into description of 
method (OWN_MTHD), results (OWN_RES) 
and conclusions (OWN_CONC), and a cate- 
gory which specifies recoverable errors made 
by the authors (OWN_FAIL). 
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Category | Description  — || Category 


AIM Statement of specific research goal, or || OWN-CONC 
hypothesis of current paper 


Nov_ADV Novelty or advantage of own approach 


Co_GRO No knowledge claim is raised (or knowl- 
edge claim not significant for the paper) 


OTHR Knowledge claim (significant for paper) 
held by somebody else. Neutral descrip- 
tion 

Knowledge claim (significant) held by 
authors in a previous paper. Neutral de- 
scription. 

New Knowledge claim, 
methods 


ee [nebo OE 
OwWN-FAIL A solution/method/experiment in the pa- 
per that did not work 
OwN-RES Measurable/objective outcome of own 
work 


own work: 


| ANTISUPP 


Findings, conclusions (non-measurable) 
of own work 
Comparison, contrast, 
other solution (neutral) 
Lack of solution in field, problem with 
other solutions 

Clash with somebody else’s results or 
theory; superiority of own work 


CoDI difference to 


SUPPORT Other work supports current work or is 
supported by current work 


B Other work is used in own work 


Statements/suggestions about future 


work (own or general) 


Figure 2: AZ-II Annotation Scheme. 


e Category TEXTUAL was discontinued, be- 
cause it is less informative than the other cat- 
egories. 

e Two new categories were introduced, 
Nov_ADV (advantages of the new knowl- 
edge claim) and FUT (declaration of 
limitations or future work). 


Our AZ-II categories are more fine-grained than 
the original AZ categories. The reasons for this are 
twofold: To bring AZ closer to contemporary cita- 
tion function schemes, and to incorporate distinc- 
tions recently found useful by other researchers. 
For instance, Chichester et al. (2005) argue that 
ANTISUPP is particularly important. The finer 
grain in AZ-II has been accomplished purely by 
splitting existing AZ categories; hence, the coarser 
AZ categories are recoverable (with the exception 
of the TEXTUAL category). Annotation examples 
are given in the appendix. 

As in AZ, citations are an important but not nec- 
essarily decisive cue for a sentence to belong to 
a particular zone. The guidelines mention cita- 
tions as one factor in deciding whether a knowl- 
edge claim holds, and citations occur in several 
examples, so it is likely that the presence of ci- 
tations would have influenced annotators in their 
decision. 

Of the changes, the distinction which is likely 
to have the greatest impact on the annotation is 
the split of OWN according to the stage of the au- 
thors’ problem solving process — into methods, re- 
sults, conclusion or local failure. In most life sci- 
ences, descriptions of research as a problem solv- 
ing process are a dominant phenomenon, whereby 


problem-solving descriptions can be of differing 
length and embeddedness. For instance, in syn- 
thetic chemistry, the starting compound for the 
main synthesis in the paper may first have to be 
synthesised itself (if it is not commercially avail- 
able, for instance). In that case, arriving at the 
compound is an intermediate, smaller problem- 
solving process which enables the larger problem- 
solving process that represents the new KC. 


The original AZ scheme didn’t mark the dis- 
tinction, possibly because it is not as easily ob- 
servable in CL as it is in the life sciences, and 
because problem-solving stages were not part of 
the main analytic interest of AZ, which focused 
on how scientific argumentation is related to de- 
scriptions of own and other work. Also, neither of 
the traditional AZ applications (summarisation or 
citation indexing) had any direct use for the subdi- 
vided categories. But in the life sciences, there 
are applications which would make use of such 
a subdivision. For instance, in chemistry there 
is a niche for search applications which guide 
searchers directly to the method and/or result sec- 
tions in papers. Specifically, the OWN_FAIL cat- 
egory is motivated by the failure—and—recovery 
search. In text, OWN_FAILmarks cases where the 
authors helpfully mention in passing steps which 
were found not to work during a long synthetic 
procedure (often the ‘total synthesis’ of a com- 
pound which is found in nature). Such cases hap- 
pen frequently, and are generally followed by a 
‘recovery’ statement which explains how the prob- 
lem can be avoided. Another possible applica- 
tion that calls for a subdivision is Feltrim et al.’s 
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(2005) rhetorical writing system for novice writ- 
ers. It trains novices in writing rhetorically well- 
formed abstracts and therefore must have a way of 
distinguishing, for instance, between methods and 
results. 

Note that several of the applications based on 
AZ and AZ-II in general rely on the rare categories 
much more than they rely on the more frequent 
categories. OWN_FAIL is an example of a rare but 
important category, and so is AIM, which is central 
to summarisation applications. The comparative 
and contrastive categories CODI ANTISUPP and 
GAP_WEAK, on the other hand, are particularly 
useful to citation-based search applications. 

Other AZ-like schemes for scientific discourse 
created for the biomedical domain (Mizuta and 
Collier, 2004) and for computer science (Feltrim 
et al., 2005) also made the decision to subdivide 
OWN, in similar ways to how we propose here. 
The current work, however, is the first experimen- 
tal proof that humans can make this distinction — 
and the others encoded in AZ-II — reliably, and in 
two quite distinct disciplines. 


3 Discipline-Independent Non-Expert 
Annotation 


An important principle of AZ is that its categories 
can be decided without domain knowledge. This 
rule is anchored in the guidelines: when choosing 
a category, no reasoning about the scientific facts 
is allowed. The avoidance of domain-knowledge 
has its motivation in a strategy for a hypotheti- 
cal automatic text-understanding system for unre- 
stricted texts. Given the state of the art in text pro- 
cessing and knowledge representation, text under- 
standing systems should in our opinion use gen- 
eral, rhetorical, and logical aspects of the text, 
rather than attempting to recognise or represent the 
scientific knowledge contained in the text. What 
the human annotation — the gold standard — should 
then do is to simulate the best possible output that 
such a system could theoretically create. 

Annotators may use only general, rhetorical or 
linguistic knowledge; knowledge which is shared 
by all proficient speakers of a language. The 
guidelines spell out what is meant by these general 
principles. For instance, one can use lexical and 
syntactic parallelism in a text to infer that the au- 
thors were setting up a comparison between them- 
selves and some other approach. 

There is, however, a problem with annotator ex- 


pertise and with the exact implementation of the 
“no domain knowledge” principle. This problem 
does not become apparent until one starts work- 
ing with disciplines where at least some of the an- 
notators or guideline developers are not domain 
experts (chemistry, in our case). Domain experts 
naturally use scientific knowledge and inference 
when they make annotation decisions. It would 
be unrealistic to expect them to be able to disre- 
gard their domain knowledge simply because they 
were instructed to do so. Additionally, when all 
annotators/scheme developers are domain experts, 
it is hard to even notice the cases where they “ac- 
cidentally” use domain knowledge during anno- 
tation. We therefore artificially created a situa- 
tion where all annotators are “semi-informed non- 
experts”, which forces them to comply with the 
principle, namely by the following rules: 


Justification: Annotators have to justify all an- 
notation decisions by pointing to some text-based 
evidence, and by giving the section heading in the 
guidelines that describes the particular reason for 
assigning the category. General discipline-specific 
knowledge an annotator may happen to have is ex- 
cluded as justification. Annotators’ justifications 
have to be typed into the annotation tool and are 
open to challenge during the training phase. Much 
of the allowable justification comes in the form 
of general and linguistic principles, e.g., an ex- 
plicit cue phrase, the title, or the structural simi- 
larity of textual strings. For instance, annotators 
are allowed to infer that process-VPs in the title 
are likely to be the contribution (knowledge of the 
actual concrete contribution of a paper is a require- 
ment for annotation of AIM). 


Discipline-specific Generics: The guidelines 
contain a section with high-level facts about the 
general research practices in the discipline. These 
generics constitute the only scientific knowledge 
which is acceptable as a justification, and are 
aimed to help non-expert annotators recognise 
how a paper might relate to already established 
scientific knowledge, so that they will be able 
to avoid common mistakes about the knowledge 
claim status of a certain fact. For instance, the bet- 
ter they are able to distinguish what is commonly 
known from what is newly claimed by the authors, 
the more consistent their annotation will be. 


Annotation with expert-trained non-expert an- 
notators means that a domain expert must be avail- 
able initially, during the development of the anno- 
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tation scheme and the guidelines, either as a co- 
developer or as an informant. The domain expert’s 
job is to describe scientific knowledge in that do- 
main in a general way, in as far as it is neces- 
sary for the scheme’s distinctions, and to write the 
domain-specific rules for the individual categories, 
including the choice of example sentences. This 
means that the guidelines are split into a domain- 
general and a domain-specific part. 


The discipline-specific generics in chemistry 
come in the form of a “chemistry primer”, a 10- 
page collection of high-level scientific domain 
knowledge. It contains: a glossary of words a non- 
chemist would not have heard about or would not 
necessarily recognise as chemical terminology; a 
list of possible types of experiments performed 
in chemistry; a list of commonly used machin- 
ery; a list of non-obvious negative characterisa- 
tions of experiments and compounds (“sluggish’, 
“inert’); and a list of possible types of knowledge 
claims. For instance, in chemistry each chemi- 
cal substance mentioned can have in principle a 
knowledge claim associated with its discovery or 
invention — with the exception of water, rock salt, 
the metals known in prehistory and a few others. 
If a compound or process is however considered to 
be so commonly used that it is in the “general do- 
main” (e.g., “the Stern—Volmer equation” or “the 
Grignard reaction”), it is no longer associated with 
somebody’s knowledge claim, and as a result its 
usage is not to be marked with category USE. 


Descriptions of individual categories can have 
domain-specific subsections, as well as the gen- 
eral ones. For instance, if the text states that the 
authors could not replicate a published result, the 
guidelines describe the cases when this is the au- 
thors’ fault (OWN_FAIL) in contrast to the cases 
where this is an indirect accusation of the previ- 
ous experiment (ANTISUPP). 


Another potentially unclear distinction is 
between results (OWN_RES) and conclusions 
(OWN_CONC). The difference is defined on 
the basis of how much reasoning is necessary 
to be able to make the statement concerned. If 
all the authors did was to read a measurement 
off an instrument, the label OWN_RES applies. 
Reasoning points to OWN_CONC; it is some- 
times linguistically marked (“therefore’, “we 
conclude’, “this means that”), but in many cases, 
domain knowledge may be required to decide 


whether reasoning was necessary to make a 


certain statement. Possible OWN_RES statements, 
according to the chemistry primer, include: state- 
ments of simple numerical result; descriptions of 
graphs; descriptions of atoms’ positions in three- 
dimensional space; statements of trends, unless 
a reason for these results is given; comparisons 
of results of more than one experiment, unless a 
reason for these results is given. 

The chemistry primer also lists phenomena 
which in a typical experiment would be read off 
chemical machinery (e.g., “Stark effect”). This list 
gives the non-expert annotator an objective crite- 
rion to answer the question how likely it is that a 
certain statement by the authors was arrived at by 
inference. We also found that our list of phenom- 
ena which can be read off machinery, which was 
compiled from the first 30 papers, generalised well 
to the other 40 papers considered. 

The chemistry primer is not an attempt to sum- 
marise all methods and experimentation types in 
chemistry; this would be impossible to do, cer- 
tainly in a few pages. Rather, it tries to answer 
many of the high-level questions a non-expert 
would have to an expert, in the framework of AZ. 

This methodology allows to hire expert and 
non-expert annotators and bring them in line with 
each other. We believe it could be expanded rel- 
atively easily into many other disciplines, using 
domain experts which create similar primers for 
genetics, experimental physics, cell biology, but 
re-using the bulk of the guidelines. 


4 Annotation Experiments 


The annotators were the co-developers of the an- 
notation scheme and the authors of this paper. 
Whereas all three annotators have good back- 
ground knowledge in CL, the largest difference be- 
tween them concerns their expertise in chemistry: 
Annotator A is a PhD-level chemist, Annotator B 
has two years’ of undergraduate training in chem- 
istry and can therefore be considered a chemical 
semi-expert, and Annotator C has no specialised 
chemistry knowledge. 

As agreement measure we choose the Kappa 
coefficient & (Fleiss, 1971; Siegel and Castellan, 
1988), the agreement measure predominantly used 
in natural language processing research (Carletta, 
1996). « corrects raw agreement P(A) for agree- 
ment by chance P(E): 


= R-E) 
BOSSIP A) 
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No matter how many items or annotators, or 
how the categories are distributed, x = 0 when 
there is no agreement other than what would be 
expected by chance, and x = 1 when agreement 
is perfect. If two annotators agree less than ex- 
pected by chance, « can also be negative. Chance 
agreement P(E) is defined as the level of agree- 
ment which would be reached by random anno- 
tation using the same distribution of categories as 
the real annotators. All work done here is reported 
in terms of Fleiss’ x. ' « is also designed to ab- 
stract over the number of annotators as its formula 
relies on the proportion of expected vs. observed 
pairwise agreements possible in a pool. That is, 
k for k annotators will be an average of the val- 
ues of «x taking all possible m-tuples of annota- 
tors from the annotator pool (with m < k). Asa 
side effect of its definition of random agreement, 
k treats agreement in a rare category as more sur- 
prising, and rewards such agreement more than an 
agreement in a frequent category. This is a desir- 
able property, because we are more interested in 
the performance of the rare rhetorical categories 
than we are in the performance of the more fre- 
quent zone categories. 


4.1 Data 


For chemistry, 30 random-sampled papers from 
journals published between 2004 and 2007 by the 
Royal Society of Chemistry were used for anno- 
tation*. The papers cover all areas of chemistry 
and some areas close to chemistry, such as climate 
modelling, process engineering, and a double- 
blind medical trial. The data used for the exper- 
iment contains a total of 3745 sentences. 

For computational linguistics, 9 papers were an- 
notated, with a total of 1629 sentences. The papers 
were published between 1998 and 2001 at ACL, 
EACL or EMNLP conferences, and were taken 
from the Computation and Language archive. 
Both chemistry and CL papers were automatically 
sentence-split, with manual correction of errors; 
acknowledgement sections were disregarded. A 


! Artstein and Poesio (2008) observe that there are several 
version of k which differ in how P(E) is calculated. In par- 
ticular, Fleiss’ (1971) « calculates P(E) as the average ob- 
served distribution of all annotators, whereas Cohen’s (1960) 
k calculates P(E) only on the basis of the other annotator(s). 

2100 papers across a spread of disciplines from the Jan- 
uary 2004 issues of the RSC were selected blindly (but with 
an attempt to cover most areas of chemistry). 30 out of these 
were random sampled for annotation; the rest were used for 
annotation development. 


Category Chem CL | Category Chem CL 


OWN_MTHD 25.4 55.6 | SUPPORT 
Own_RES 24.0 5.6 | GAP_WEAK 
Own_Conc 15.1 10.7 | FUT 

OTHR 8.3 10.0 | Nov_ADv 
CoDI 
OwWN-_FAIL 
ANTISUPP 


USE 7.9 Zeb 
Co_GRO 6.7 S.T 
PREV_OWN 3.4 1.7 
AIM 2.3 1.8 


Figure 3: Frequency of AZ-II Categories (in %). 


web-based annotation tool was used for guideline 
definition and for annotation. 

Our choice of which data sets to use was ef- 
fected by the relative length of papers more than 
by the journal/conference distinction. Average 
article length between chemistry journal articles 
(3650 words/paper) and CL conference articles 
(4219 words/paper) is comparable, so conference 
articles in CL seem a much better choice for com- 
parative work than journal publications, which are 
often very long in CL. Additionally, conferences 
have a high profile in CL, and we found the con- 
ference publications to be of high editorial quality. 
We are nevertheless interested in the structure of 
longer journal articles, and plan to investigate CL 
journals in the future. 

The annotations were done using a web-based 
annotation tool. Every sentence is assigned a cat- 
egory. No communication between the annotators 
was allowed. 


4.2 Results 


The inter-annotator agreement for chemistry 
was k = 0.71 (N=3745,n=15,k=3). For CL, 
the inter-annotator agreement was k = 0.65 
(N=1629,n=15,k=3). For comparison, the 
inter-annotator agreement for the original, CL- 
specific AZ with 7 categories was k = 0.71 
(N=3420,n=7,k=3). Given the subjective nature 
of the task and the fact that AZ-II introduces ad- 
ditional distinctions, the AZ-II agreement can be 
considered acceptable for CL and relatively high 
for chemistry. Additionally, chemistry annota- 
tion used one non-expert annotator, who had no 
chemistry-specific domain knowledge apart from 
that in the chemistry primer. 

The distribution of categories for the two disci- 
plines is given in Fig. 3. As expected, there is a 
large discrepancy in frequency between the (rare) 
rhetorical categories and the (much more fre- 
quent) zone categories OWN_MTHD, OWN_RES, 
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OWN-CONC, OTHR and CO_GRO. For supervised 
learning, too few examples of any category can be 
a problem. There are methods which attempt to re- 
duce the annotation effort by using a trained clas- 
sifier to suggest possible cases to a human. How- 
ever, the classifier can only find examples similar 
to the ones that have already been manually clas- 
sified, when the real problem is a recall-problem, 
i.e., the challenge is to find more new examples in 
the multitude of possible sentences. To solve this 
in a fundamentally sound way, there seems to be 
no other way than to annotate more texts, at the 
cost of more human effort. 

If we consider the differences across disci- 
plines, the most striking ones concern the distri- 
bution of OWN_MTHD, which is more than twice 
as common in CL (56% v. 25%), and OWN_RES, 
which is far more common in chemistry overall 
(24% v 5.6%). Usage of other people’s knowl- 
edge claims or materials also seems to be more 
common in chemistry, or at least more explicitly 
expressed (7.9% vs 2.7%). With respect to the 
shorter, rarer categories, there is a marked dif- 
ference in OWN_FAIL (0.1% in CL, but 0.8% in 
chemistry? and SUPPORT, which is more common 
in chemistry (1.5% vs 0.7%). However, this effect 
is not present for ANTISUPP (contradiction of re- 
sults), the “reverse” category to SUPPORT, (0.6% 
in CL vs 0.5% in chemistry). 

As far as the chemistry annotation is con- 
cerned, it is interesting to find out whether Annota- 
tor A was influenced during annotation by domain 
knowledge which Annotator C did not have, and 
Annotator B had to a lower degree*. We there- 
fore calculated pairwise agreement, which was 
KAC = 0.66, KBC = 0.73 and KAB = 0.73 (all: 
N=3745,n=15,k=2). That means that the largest 
disagreements were between the non-expert (C) 
and the expert (A), though the differences are 
modest. This might point to the fact that Anno- 
tators A and B might have used a certain amount 
of domain-knowledge which the chemistry primer 
in the guidelines does not yet, but should, cover. 

In an attempt to determine how well cate- 
gories are defined, we first consider the binary dis- 


These are not large differences in absolute terms — 55 
items identified as OWN-FAIL by at least one annotator in 
chemistry, vs. 7 such items in CL, the relative difference is 
large and confirms that in chemistry papers, particularly de- 
scriptions of synthesis procedures, OWN_FAIL cases appear 
relatively frequently. 

“This question does not arise in the case of CL, as all an- 
notators can be considered experts in this respect. 


tinction between zone categories (OWN_MTHD, 
OWN_RES, OWN_CONC, OWN_FAIL, OTHR, 
PREV_OWN and CO_GRO) and rhetorical cate- 
gories (the other 8). This shows an inter-annotator 
agreement Of Kpinary = 0.78 (N=3745, n=2, k=3) 
for chemistry and Kpinary = 0.65 (N=1629, n=2, 
k=3) for CL, indicating that annotators find it rel- 
atively easy (chemistry) or at least not more dif- 
ficult than the overall distinction (CL) to distin- 
guish these two types of categories. We next per- 
form Krippendorff’s (1980) category distinctions 
(Fig. 4). Here, all categories apart from the one 
diagnosed are collapsed, and what is reported is 
the difference of inter-annotator agreement when 
compared to the overall distinctiveness («=0.71 
for chemistry, s=0.65 for CL). Where the differ- 
ence is positive, the annotators could distinguish 
the given category better than they could distin- 
guish all categories, and where they are negative, 
correspondingly worse.> 

The results confirm that categories USE, AIM, 
OWN_MTHD, OWN_RES and FUT are particularly 
well distinguished in both disciplines. This is a 
positive result, as these categories are important 
for several types of searches. In these cases the 
guidelines seem to fully suffice for their descrip- 
tion, but then again good performance of AIM, 
FuT and USE is not that surprising, as they are 
signalled clearly by linguistic and non-linguistic 
cues. However, there are three categories with 
particularly low distinguishability in both disci- 
plines: ANTISUPP, OWN_FAIL and PREV_OWN. 
As ANTISUPP and OWN_FAIL are crucial for the 
envisaged downstream tasks, the problems with 
their definition should be identified and fixed. We 
are in the process of systematically troubleshoot- 
ing the guidelines for those categories. 

The table also shows that category definition 
has discipline-specific problems. For instance, 
we believe that the fact that distinctiveness for 
OWN-FAIL is so bad for CL must be due to the 
fact that we only encountered very few potential 
OWN_FAIL cases in this domain. The definition 
of the categories SUPPORT and Nov_ADvV also 
seem to be substantially more confusing for CL 
than for chemistry. However, CODI is a category 
which shows average distinctiveness for CL, but 
much worse distinctiveness for chemistry. We be- 
lieve this is due to the fact that comparisons of 


S All « values for chemistry were measured with N=3745, 
n=2, k=3; for CL with N=1629, n=2, k=3. 
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methods and approaches are more common in CL 
and are clearly expressed, whereas in chemistry 
the objects that are involved in comparisons are 
more varied and at a lower grade of abstraction 
(e.g., compounds, properties of compounds, coef- 
ficients, etc.), which obviously has a negative ef- 
fect on the distinctiveness of this category. 


Category Chem CL | Category Chem CL 


USE . d Nov_ADv_ -0.07 
AIM i : OwWN-_CONC -0.08 
OWN_MTHD GAP_WEAK -0.08 
OwWN_RES PREV_OWN -0.11 
-0.19 
-0.35 
-0.36 


FUT ; A OWN-FAIL 


ANTISUPP 
CoDI 


Co_GRO 
SUPPORT 
OTHR 


Figure 4: Krippendorff’s Diagnostics for Category 
Distinction (x, relative to Overall Distinctiveness). 


We also provide a direct comparison of our an- 
notation results with those from the original AZ 
scheme. Comparisons between two similar anno- 
tation schemes can be made by collapsing those 
categories in each scheme which are not distin- 
guished in the other scheme. Such a comparison 
can of course only ever approximate the smallest 
common denominator between two schemes. 

The AZ-II categories were collapsed into a set 
of six categories that closely resemble AZ cate- 
gories, as described in section 2 (with OWN simu- 
lated by the union of OWN_FAIL, OWN_MTHD, 
OWN_RES, OWN_CONC, FUT, and NOV_ADV). 
This created a 6-category AZ annotation. 

As TEXTUAL is not marked up in AZ-II, the 
original AZ annotation was also collapsed, by in- 
corporating TEXTUAL examples into OWN. The 
two 6-pronged AZ-annotations are now more di- 
rectly comparable. Inter-annotator agreement for 
the collapsed AZ-II showed « = 0.75 (N=3745, 
n=6, k=3). This compares favourably to the col- 
lapsed AZ’s agreement of x = 0.71 (N=3420, n=6, 
k=3); but when comparing the raw numerical re- 
sults one should consider that different data from 
different disciplines is used (chemistry in AZ-II, 
CL in AZ). 

These results should be interpreted as a pos- 
itive result for the domain-independence of AZ, 
and also for the feasibility of using trained non- 
experts as annotators. The additional work that 
went into the guidelines has produced annotation 
of a high consistency, even though AZ-II provides 
more distinctions (15 categories vs. 7 in AZ). 


There is also the faint possibility that discourse 
annotation of chemistry is intrinsically easier than 
discourse annotation of CL, because it is a more 
established discipline and not despite of it. For 
instance, it is likely that the problem-solving cat- 
egories OWN_FAIL, OWN_MTHD, OWN_RES and 
OWN-_CONC are easier to describe in a discipline 
with an established methodology (such as chem- 
istry), than they are in a younger, developing dis- 
cipline such as computational linguistics. 


5 Conclusion 


Argumentative Zoning is an analysis of the rhetor- 
ical progression of the scientific argument in a pa- 
per. In this paper, we have made the following 
contributions to this analysis: 

e We have presented a more informative 
scheme, which additionally recognises the 
structure of an experiment in terms of prob- 
lem solving (method — results — conclusions) 
and makes more fine-grained distinctions in 
some of the sentiment-inspired relational cat- 
egories (e.g., criticism and comparisons to 
other approaches). 

e We introduced an annotation methodology 
which attempts to systematically exclude the 
use of annotators’ extraneous domain knowl- 
edge from the annotation. 

e We have experimentally shown that human 
coders can independently annotate this new 
AZ scheme in two distinct disciplines. Our 
results show inter-annotator agreements of 
&=0.65 and «=0.71 for computational lin- 
guistics and chemistry, respectively. 


Overall, the outcome of this work indicates 
that the phenomena described in AZ can be de- 
fined in a domain-independent way. The experi- 
ment also tested how realistic the “expert-trained 
non-expert” approach to domain-knowledge free 
annotation is. The fact that the agreement be- 
tween three annotators (an expert, a semi-expert, 
and a non-expert) is acceptable overall vindicate 
our task definition as domain-knowledge free (us- 
ing the tools of justification and domain-specific 
generic knowledge). However, the agreements in- 
volving the semi-expert are higher than the agree- 
ment between expert and non-expert. This prob- 
ably means that the chemistry generics were not 
fully adequate to ensure that the non-expert un- 
derstood enough of the chemistry to achieve the 
highest-possible agreement. 
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The automation of AZ-annotation is underway. 
This requires adaptation of the high-level features 
used in AZ (Teufel and Moens, 2002) to chemistry. 
We are also preparing an annotation experiment 
with naive annotators. Another research avenue 
is the expansion of the guidelines to other disci- 
plines such as bio-medicine, and to longer journal 
articles, e.g., in computational linguistics. 
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Appendix: Annotation Examples® 


We now describe in this paper a synthetic route for the 
Junctionalisation of the framework of mesoporous organosil- 
ica by free phosphine oxide ligands, which can act as a tem- 
plate for the introduction of lanthanide ions. (b514878b) 


The aim of this paper is to examine the role that train- 
ing plays in the tagging process ... (9410012) 


Moreover, the simplicity and ease of application 
of the electrochemical method [...] should also be emphasised 
and makes it an interesting and valuable synthetic tool. 


(b513402a) 
Other than the economic factor, an important ad- 


vantage of combining morphological analysis and error detec- 
tion/correction is the way the lexical tree associated with the 
analysis can be used to determine correction possibilities. 


(9504024) 


Co_GRO |A wide range of organosulfur compounds are bi- 


ologically active and some find commercial application as 
fungicides and bactericides\~*. (b514441h) 


It has often been stated that discourse is an inher- 
ently collaborative process ... (9504007) 


In their system, antibody immobilized on a solid sub- 
strate reacts with antigen, which binds with another antibody 
labelled with peroxidase. (b313094k) 


OTHR | But in Moortgat’s mixed system all the different re- 
source management modes of the different systems are left in- 
tact in the combination and can be exploited in different parts 


of the grammar. (9605016) 


As a program aimed at the applications of 


imines?™9:) we have studied the formation of carbanions 
from imines and their subsequent reactions. (b200198e) 


Earlier work of the author (Feldweg 1993; 


Feldweg 1995a) within the framework of a project on corpus 
based development of lexical knowledge bases (ELWIS) has 
produced LIKELY... (9502038) 


OWN_MTHD | In order for it to be useful for our purposes, 


the following extensions must be made: (0102021) 
On the other hand, a tertiary amide can be an 
excellent linking functional group. (b201987f) 


°Corpus examples are taken from our chemistry and CL 
data sets; indicated by their respective file numbers. 
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Initial attempts to improve the dehydration of 4 


via chemical or thermal means were unsuccessful; similarly, 
attempts to couple the chlorosilane (Me3Si)2 (Me2CISi)CH 
with Ag20 failed. (b510692c) 


When the ABL algorithms try to learn with two 


completely distinct sentences, nothing can be learned. 


(0104006) 


While the acid la readily coupled to the olefin, 


the corresponding boronic ester was surprisingly inert under 
the reaction conditions. (b311492a) 


[ OwnRes ] All the curves have a generally upward trend but 


always lie far below backoff (51% error rate). (0001012) 


It is unlikely that every VOC emit ted by plants 
serves an ecological or physiological role ... (b507589k) 


implausible, it appears that in fact the major problems do not 
lie in the area of grammar size, but in input length. (9405033) 


Various methods of preparation have been de- 
veloped, but they often suffer from low yield and tedious 


[16,17,28,31] (b200888m) 


GAP_WEAK | Here, we will produce experimental evidence 


suggesting that this simple model leads to serious overesti- 
mates of system error rates... (9407009) 


separation. 


However, the measured values of the dielectric con- 
stant (€ = 310) are lower than the values reported by Ganguli 
and coworkers?) for BSTO pellets sintered at 1100 degC... 
(b506578)) 
Unlike most research in pragmatics that focuses on 
certain types of presuppositions or implicatures, we provide a 
global framework in which one can express all these types of 
pragmatic inferences. (9504017) 


SUPPORT | This is in line with the findings of Martin and Illas 
(84,85) (b515732c) 


for inorganic solids 


SUPPORT | Work similar to that described here has been car- 
ried out by Merialdo (1994), with broadly similar conclusions. 


(9410012) 
The diamine 10 was prepared following a previously 
published procedure“® , (b110865b) 


We use the framework for the allocation and transfer 
of control of Whittaker and Stenton (1988). (9504007) 


FUT | Our further efforts are directed towards the above 
goal,... and overcoming limitations pertaining to the electron- 
poor arylboronic acids. (b311492a) 


FUT | An important area for future research is to develop 
principled methods for identifying distinct speaker strategies 
pertaining to how they signal segments. (9505025) 


[ ANTIsUPP | Although purification of Sb to a de of 95percent 


has been reported elsewhere"), in our hands it was always 
obtained as a mixture of the two [EQN]-diastereomers. 
(b310767a) 


ANTISUPP | This result challenges the claims of recent dis- 


course theories (Grosz and Sidner 1986, Reichman 1985) 
which argue for a the close relation between cue words and 


discourse structure. (9504006) 
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